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Abstract. Recently much attention has been devoted to the construction of phyloge- 
netic networks which generalize phylogenetic trees in order to accommodate complex evo- 
lutionary processes. Here we present an efficient, practical algorithm for reconstructing 
level-1 phylogenetic networks - a type of network slightly more general than a phyloge- 
netic tree - from triplets. Our algorithm has been made publicly available as the program 
LevIathan. It combines ideas from several known theoretical algorithms for phylogenetic 
tree and network reconstruction with two novel subroutines. Namely, an exponential-time 
exact and a greedy algorithm both of which are of independent theoretical interest. Most 
importantly, LevIathan runs in polynomial time and always constructs a level-1 net- 
work. If the data is consistent with a phylogenetic tree, then the algorithm constructs 
such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent 
with some level-1 network, it will find such a network. The potential of LevIathan is 
explored by means of an extensive simulation study and a biological data set. One of our 
conclusions is that LevIathan is able to construct networks consistent with a high per- 
centage of input triplets, even when these input triplets are affected by a low to moderate 
level of noise. 

1. Introduction 

Phylogenetic networks such as the one depicted in Figure [U provide a natural and pow- 
erful extension of the concept of a phylogenetic tree (see Section [2] for precise definitions 
of these two concepts as well as the other concepts used in this paper) to accommodate 
complex evolutionary processes such as hybridization, recombination or horizontal gene 
transfer. Consequently, their attractiveness to evolutionary biology as a model for repre- 
senting the evolutionary past of a set of taxa (e.g. species represented by gene or genetic 
marker sequences) whose evolution might have been driven by such processes has grown 
over the years. This in turn has generated much interest in these structures from math- 
ematicians and computer scientists working in phylogenetics [9j [131 HS1 [22]. The desire 
of biologists to use ever-longer sequences, combined with the computational complexities 
involved in dealing with such sequences, has meant that much research in mathematical 
methodology and algorithm development has focused on developing indirect methods for 
phylogenetic reconstruction. This has spawned the thriving area of supertree construction 
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[2] which aims to reconstruct a phylogenetic tree by puzzling it together from smaller trees. 
Inspired by this - and guided by the fundamental observations that (i) a phylogenetic tree 
is uniquely described by the set of triplets (i.e. rooted binary phylogenetic trees on three 
leaves) it induces (see e.g. [24]) and (ii) that a phylogenetic network is a generalization 
of a phylogenetic tree, two main triplet-based approaches for phylogenetic network recon- 
struction immediately suggest themselves. One is to essentially first employ a method such 
as TreePuzzle (see e.g. [22]) to generate a set of quartet^ from a sequence alignment, to 
then derive a set T of triplets from that set and then to use the set T to reconstruct a 
phylogenetic network. The other is to essentially first reconstruct phylogenetic trees on 
subsections of the given alignment (that might reflect different evolutionary scenarios) and 
then to reconstruct a phylogenetic network from the set of triplets induced by those trees. 

For any set T of triplets, a phylogenetic network that, in a well-defined sense, is con- 
sistent with all triplets in T can easily be constructed if the complexity of the network 
is essentially unbounded [T7J [3T] • Such networks are however only of marginal biological 
relevance. Researchers have therefore turned their attention to studying restricted classes 
of phylogenetic networks. One such class is that of a level-1 network or a galled tree i.e. a 
phylogenetic network in which essentially any two cycles are vertex disjoint (see Figure [1] 
for an example). Such networks are of practical interest because (i) they are relatively 
treelike, and (ii) their simple structure suggests the possible existence of fast algorithms 
to construct them. Indeed, Jansson, Sung and Nguyen showed that it can be decided in 
polynomial time whether there exists a level-1 network with leaf set X consistent with all 
input triplets [ISldT], if the input triplet set is dense, i.e. if a triplet is given for each com- 
bination of three taxa in X. Their algorithm will also construct such a network, if it exists. 
However, a level-1 network consistent with all input triplets might not exist for several 
reasons. Firstly, the real evolutionary history might be too complex to be described by a 
level-1 network. Secondly, some of the input triplets might be incorrect (which is likely to 
be the case in practice). 



The analogue of a triplet within the area of unrooted phylogenetic tree reconstruction. 



A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks 



3 



One response to this problem has been to increase the complexity of the networks that 
can be modelled. For example, it has been shown that, for each fixed non-negative inte- 
ger k, the problem of constructing a level-k network consistent with a dense set of input 
triplets is polynomial-time solvable [26, 28J. The higher the level, the higher the complex- 
ity of evolutionary scenarios that can be represented. However, the running time grows 
exponentially with k and initial experiments with the related heuristic Simplistic [301 EH] 
show that, since these algorithms insist on full consistency with the input triplets, only a 
small amount of noise is required in the input data to artificially inflate the level of the 
produced network. This causes an undesirable increase in both running time and network 
complexity. 

A second strategy is to place a ceiling on the complexity of the networks that can be 
constructed and to no longer demand full triplet consistency. Implemented in the program 
LevIathan [12], this paper adopts this second strategy and presents the first heuristic 
algorithm for constructing level-1 networks from triplets. Given any set of triplets, our 
heuristic always constructs a level-1 network M in polynomial time, which is in practice a 
great advantage in light of the algorithmic results mentioned above. Moreover, it attempts 
to construct M such that it is consistent with as many of the input triplets as possible. If 
a weight is given for each triplet reflecting for example some kind of confidence level one 
might have in that triplet, then our heuristic aims to maximize the total weight of the 
input triplets consistent with N ' . Optimizing such functions is NP-hard [3"lj [32] , even for 
the case of determining a tree consistent with the maximum number of triplets from an 
unweighted, dense triplet set. Having said that, an exponential-time exact algorithm was 
proposed in [31] that always finds an optimal solution, but this algorithm is only practical 
for small numbers of taxa. In addition, polynomial-time approximation algorithms have 
been formulated for level-1 network reconstruction. For example it was first shown in 
[16] how to construct in polynomial time such a network consistent with 5/12 « 0.42 of 
the input triplets. This was subsequently improved to 0.488 ... in [4], which is worst-case 
optimal. Both algorithms are mathematically interesting, but have the drawback that they 
produce networks with a highly rigid topology which are biologically unrealistic. 

LevIathan, which we outline in Section[3J combines elements from the above mentioned 
approaches into an algorithm with a strong recursive element. In addition to its polynomial 
running time, LevIathan enjoys the following desirable properties. If a set of input 
triplets is consistent with a tree, or if at any stage in a recursion such a triplet set T 
occurs, a phylogenetic tree will be generated from T. Similarly, whenever the triplet set 
is dense and fully consistent with some level-1 network, such a network will be produced. 
Both outcomes are a direct consequence of a partitioning strategy that we describe in 
Section [H If a network is produced that (when ignoring directions on the arcs) contains 
cycles of moderate size, then, due to a novel exponential-time exact algorithm which we 
describe in Section the topology of each of these cycles is locally optimal. This algorithm 
is complemented by a novel greedy algorithm to construct larger cycles which we also 
describe in that section. In addition, the output network is guaranteed to be consistent 
with at least 5/12 of the input triplets, if one uses the LevIathan option "blocks 3" 
which has its origin in the above mentioned partitioning strategy. In Section [6] we test 
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LevIathan on synthetic and real biological data; to facilitate this we also develop a novel 
level- 1 network generation method and discuss several measures for network comparison. 
Taken as a whole the results of the experiments are promising, suggesting that LevIathan 
will be genuinely useful in a real- world context, but do also highlight some limitations of 
triplet-based approaches and level- 1 networks as a model of evolution. 

We conclude this section with remarking that although LevIathan can be applied to 
both weighted and unweighted triplet sets for clarity we will restrict our exposition to 
unweighted triplet sets. 

2. Basic Concepts 

We start with some concepts from graph theory concerning directed acyclic graphs. 
Suppose G = (V, A) is a directed acyclic graph, or DAG for short. Then G is called 
connected (also called "weakly connected") if, when ignoring the directions on the arcs, 
there is a path between any two vertices of G. Moreover, G is called biconnected if G 
contains no vertex whose removal disconnects G. A biconnected subgraph if of a graph G 
is said to be a biconnected component if there is no biconnected subgraph H' ^ H of G 
that contains H. A biconnected component is said to be nontrivial if it contains at least 
three vertices. A vertex v of G is called an ancestor of a vertex v' if there is a directed 
path from v to v' . In this case, v' is also called a descendant of v. Now suppose that v\ 
and v 2 are two distinct vertices of G. Then a vertex v £ V is a lowest common ancestor of 
V\ and i>2 if v is an ancestor of both V\ and v 2 and no descendant of v is also an ancestor 
of V\ and v%. 

A (phylogenetic) network Af on some finite set X of taxa is a DAG with a single root (a 
vertex with indegree 0), whose leaves (vertices with outdegree 0) are bijectively labelled by 
the elements of X. Following common practice, we identify each leaf with its label. Thus, 
the leaf set of Af, denoted L(Af), is X. Vertices with indegree 1 are called split vertices and 
vertices with indegree at least two are called reticulations. A network Af is called semi- 
binary if each reticulation has indegree 2 and binary if in addition each reticulation has 
outdegree 1 and each split vertex outdegree 2. We will restrict our exposition to binary 
networks, but remark that LevIathan uses post-processing to simplify a constructed 
network by collapsing arcs while making sure that triplet consistency is preserved. This 
can create vertices with outdegree greater than two. 

A semi-binary network Af is said to be a level-k network, if each biconnected component 
contains at most k reticulations. Clearly, if k — then this implies that Af is a phylogenetic 
tree in the usual sense (see e.g. [24J). It is not difficult to verify that for k — 1, a nontrivial 
biconnected component of a level- A; network always has the form of two directed paths that 
both start at some vertex u and both end at some vertex v 7^ u but which are otherwise 
vertex disjoint. Such a biconnected component is often called a reticulation cycle or a gall. 
We define the size of a gall to be the number of vertices in it i.e. the number of outgoing arcs 
plus one. The network in Figure 0(a) contains a single gall of size 7, for example. Let Af 
be a phylogenetic network. An arc a of Af is said to be a cut-arc if removing a disconnects 
Af . A cut-arc is called trivial if its head is a leaf. A level-A; network is said to be a simple 
level- A; network, if Af contains no nontrivial cut-arcs and is not a level- (A; — 1) network. 
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Thus, a simple level-1 network can be thought of as a single gall with some leaves attached 
to it, as in Figure [2£a). Simple level-1 networks are the basic, recursive building blocks of 
level-1 networks, in the sense that each gall of a level-1 network essentially corresponds to 
a simple level-1 network. 

A (rooted) triplet xy\z is a phylogenetic tree on {x,y,z} such that the lowest common 
ancestor of x and y is a proper descendant of the lowest common ancestor of x and z, see 
Figure [2](b). For convenience, we call a set of triplets a triplet set. A triplet xy\z is said to 
be consistent with a network Af (interchangeably: Af is consistent with xy\z) if Af contains 
a subdivision of xy\z, i.e. if Af contains distinct vertices u and v and pairwise internally 
vertex-disjoint paths w — > x, u ^ y, v ^ u and v — > z. For example, the network in 
Figure [2]^a) is consistent with (among others) the triplets ab\c,af\c, ef\d and de\f but is 
not consistent with (among others) the triplets be\a, cd\e and fa\b. We say that a triplet 
set T is consistent with Af if all triplets in T are consistent with Af and denote the set of 
all triplets on L{Af) consistent with Af by T(Af). 



Figure 2. (a) Example of a simple level-1 network on X = {a, ...,/} and 
(b) the triplet xy\z. 

We note that, whenever a network Af contains a gall C of size 3, C can be replaced by a 
single split node to obtain the network Af' where T(Af') = T(Af). In such cases we prefer 
the latter construction to the gall because it is a more parsimonious response to the input 
data. We henceforth sharpen the above definition of a level-1 network to exclude galls of 
size 3. (Galls of size 2 are already excluded because we do not allow multiple arcs). 



In this section with present a rough outline of our 4 phase program LevIathan which 
is implemented in Java and freely available for download [12]. It takes as input a set T of 
triplets or, more generally, a set T of phylogenetic trees (e.g. gene trees) given in Newick 
format [5]. In the latter case the (weighted) union of the triplet sets induced by the trees 
in T is taken by LevIathan as triplet set T. It outputs a (possibly post-processed) 
level-1 network in DOT format [UJ and/or eNewick format [27]. The goal of LevIathan 
is to construct a level-1 network Af that maximizes the number of triplets in T that are 
consistent with it. In an optional post-processing phase the generated network can then be 
simplified by contracting all arcs (u, v) of Af where neither u nor v is a reticulation, v is not 
a leaf, such that the contraction would not cause a triplet in T that was consistent with 
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M to become not consistent with it. (Leaving such arcs uncontracted is tantamount to an 
unfounded strengthening of our hypothesis about what the "true" evolutionary scenario 
looked like). To explain the program's four phases, we require some more definitions and 
notation. To this end, suppose T is a set of triplets and define the leaf set L(T) of T to be 
the set Uter -^(0- For any subset V C L(T), we denote by T\U the set of triplets in t G T 
such that L(t) C L' . 

Algorithm 1 Basic outline of Leviathan 
Input : Triplet set T. 

Output : Level-1 network M heuristically optimizing the number of triplets consistent 
with T. 

1: Partitioning the leaf set: Find a partition C = {Li, . . . , L q } of L(T). This will be 
detailed in Section HI 

2: Gall construction: Construct a simple level-1 network M with leaf set C This will 

be detailed in Section 
3: Recursion: Recursively call LevIathan to construct, for all 1 < % < q, a level-1 

network Mi from the triplet set Tj = T\L^. 
4: Merging: Construct a network M by combining M with Mi, . . . , M q as follows. For i = 

1, . . . , q, identify leaf of M with the root of M{. 



Note that the algorithm does not backtrack in the sense that it never revises earlier made 
decisions. Also note that while the recursion and merging phases are relatively simple, 
the partition and gall construction phases are not. For these phases we designed new 
algorithms, which will be described in the following two sections. 

4. Partitioning the Leaf Set 

Based on their order of priority, we next describe the three steps that make up the 
partitioning strategy employed by LevIathan to find a suitable partitioning £ of the leaf 
set of a triplet set. To explain this strategy in detail, assume for the rest of this section 
that T is a set of triplets and put L := L(T). 

The first step of LevIathan's partitioning strategy is to determine whether an Aho 
move is possible for T. This move is based on an algorithm originally introduced by Aho 
et. al. in [1J. Following [24J where this algorithm is referred to as BUILD algorithm, this 
algorithm relies on the clustering graph GWt] induced by T on any subset V C L. For the 
convenience of the reader, we remark that for any subset L'Ci the vertex set of G[l>,t] 
is V and any subset {a, b] G ( \ ) forms an edge in G[l',t] if there exists some c G V such 
that ab\c G T. 

Aho move: Construct the clustering graph G[l,t]- If G[l,t] is not connected then use 
the vertex sets of the connected components of G\l,t\ as the blocks of the partition £. 
Otherwise, say that the Aho move is not successful. 
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Note that the Aho move is essentially the embedding of the Build algorithm inside 
Leviathan. Always starting with such a move means that, if a set of triplets is consistent 
with a phylogenetic tree, then a phylogenetic tree will be output. More generally it means 
that Leviathan will potentially produce level-1 networks with treelike regions. Also note 
that if the Aho move is successful then the Gall construction phase of Leviathan is not 
necessary: the components of the clustering graph will simply correspond to subnetworks 
that the algorithms "hangs" from a single split vertex, just like the BUILD algorithm. 
(This can lead to split vertices with more than two children, which as already mentioned 
LevIathan supports via post-processing). 

Our motivation for prioritising the Aho move is twofold. Firstly, it adheres to the 
parsimony principle: if the data can be explained equally well by both a phylogenetic tree 
and a network, choose the tree. As a non-trivial example, note that the triplet set T(JV) 
induced by the phylogenetic tree N on X = {x,y, z, g, /} depicted in Figure [3^a) is also 
consistent with the level-1 network M' on X depicted in Figure [3]^b) and that M' is in 
some sense minimal with this property, that is, no arc of N' can be deleted such that 
the triplets in T(N) remain consistent with the resulting phylogenetic tree. Secondly it 
exploits the not entirely trivial observation that, when attempting to construct a level-1 
network consistent with the maximum number of triplets in a given set, it can never be 
suboptimal to make an Aho move, when this is possible [25]. 



r 




Figure 3. The triplets in T{M) are also consistent with Af' and Af f is 
minimal with this property. 



If the Aho move is not successful, the next step is to try a JNS move. To explain this 
move we require a further concept which was originally introduced by Janson, Nguyen and 
Sung in [16]. Suppose L' C L(T). Then V is called an SN-set (ofT) if there exists no 
triplet xy\z in T such that x,z £ V and y £" V '. A SN-set S of T is called maximal if 
S 7^ L(T) and there does not exist an SN-set S' ^ L(T) such that S C S' . Also, for 
some partition C = {L\, L q } of L(T) into q blocks Lj, 1 < i < q, we define the induced 
triplet set TV£ of £ as the set of all triplets L^L^L^ on L for which there exists a triplet 
xy\z £ T where x £ L iy y £ Lj and z £ and i, j and k are all distinct. Note that TV C 
is dense if T is dense. 

JNS move: If (a) the set S of maximal SN-sets of T is a partition of L(T), and (b) 
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TVS is a dense set of triplets that is consistent with some simple level-1 network J\f s , then 
define S to be the sought after partition £ (and use A/" s as M in the Gall construction 
phase). Otherwise and as in the case of an Aho move, say that the move is not successful. 

The JNS move is essentially the level-1 analogue of the above described Aho move. The 
main difference is that although, using for example the BUILD algorithm, it is always 
possible to decide in polynomial time if a triplet set is consistent with a phylogenetic tree 
it is in general NP-hard to determine whether there is a level-1 network that is consistent 
with the set (even if we restrict to simple level-1 networks). However, the situation changes 
and becomes decidable in polynomial time if the triplet set in question is dense [16J. Density 
of the induced triplet set is therefore a necessary requirement for a successful JNS move. 
Hence, requirement (b). To motivate requirement (a), note that for dense triplet sets T 
the set S of maximal SN-sets of T always forms a partition of L(T). For general (i.e. 
non-dense) triplet sets this is not always, but sometimes, true. Requirement (a) is thus 
included to extend JNS moves to non-dense triplet sets. 

We conclude the discussion of the JNS move with remarking that this move enjoys the 
same optimality properties as an Aho move. More precisely, prioritising the JNS move 
guarantees that a level-1 network consistent with all the original triplets will be produced 
if such a network exists and the original triplet set was dense. (And, less obviously, that 
making a JNS move can never lead to a suboptimal network, assuming subsequent recursive 
steps give optimal networks). Again analogous to a successful Aho move, a successful JNS 
move allows the full generality of the Gall construction phase (see below) to be bypassed 
by utilizing the already computed simple level-1 network Af s . 

Due to noise in real biological data (or the inherent complexity of the underlying network) 
however, it is generally too much to hope for that one of the two moves described so far 
will always be successful. We therefore have adapted a strategy from [16] to obtain a 
third move which we call the Heuristic move. The heart of this move is formed by a score 
function score(T, C), CJ a partition of the leaf set of T, from [TB] plus two operations (PI) 
and (P2) that will allow us to manipulate a partition C i d with the aim of ensuring that 
the score of the new partition C new is at least as good as the score of the given partition, 
i.e. score(T, £ new ) > score(T, £ id) holds. To describe both the score function and these 
two operations, we require some more concepts. Suppose C! = {£,[,...,£,'} is a partition 
of L(T). Then to £' we associate the following four subsets 

T b ad = T bad (£', T) = ^xy\z ET x,z E L'^y E L'j where i ^ j j, 
Tgood = T good (£', T) = ^xy\z ET x,y E L'^z E L'j where i ^ j j, 



Ti OC ai = T loca i(£',T) = < xy\z E T 



x, E L' { ,y E L'-, z E L' k where i ^ j ^ k ^ i> , 



Tdefer = T defer (C , T) = j xy \ z E T x, y , z E L\ for some 1 < % < g j 
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Note that {T bad ,T good ,Ti oca i,T de f er } clearly forms a partition of T. The aforementioned 
score function score(T, C) is then defined as score(T, C) = 4\T de f er \ + 7\T [oca i\ + 12\T good \. 
Denoting as above a given partition by L \ d and the generated partition by C new then the 
two aforementioned operations (PI) and (P2) are defined as follows. 

(PI) If A, Be C old and a e A then C new = {A-a,BU {a}} U (C old - {A, B}). 
(P2) If A G Coid with \A\ > 2 and a e A then C new = {A - a, {a}} U [C M - A). 

Armed with these definitions, we are now ready to state the Heuristic move. 

Heuristic move: Starting with C Q i d = {L} we exhaustively search for an operation (PI) 
or (P2) which, when applied to C Q i d , will create a partition C new with score(T, C new ) > 
score(T, C i d ). If no such C new exists then the sought after partition C is C Q i d and we are 
done. Otherwise put C Q i d = C new and repeat. 

Note that the Heuristic move is guaranteed to terminate in polynomial time and that it will 
never return {L} as the final partition [I6JE Also note that the Heuristic move can generate 
partitions with 4 or more blocks (as long as this raises the score). This is in contrast to 
its analog in [16, page 1118] where the number of blocks in a partition is restricted to 3. 
Arguably somewhat unassuming looking this restriction to 3 blocks guaranteed that for any 
triplet set T a level-1 network could be constructed which was consistent with a fraction 
5/12 of the triplets in T [16]. Interestingly - and although the system of inequalities 
underpinning the 5/12 result also holds in the case when there is no restriction on the 
number of blocks, as in this case - a simple level-1 network (the construction of which is 
the purpose of the second phase of LevTathan) is in the worst case consistent with no 
more than « 1/3 of the triplets in a given set. An example in point is a partition where 
each block is of size one. For such a partition the guaranteed lower bound in the worst- 
case tends to 1/3. So removing the restriction to 3 blocks can theoretically undermine 
the 5/12 lower bound of [16J. However, and as suggested by examples, the dropping of 
the 3-block restriction allows in practice the construction of a wider range of networks 
(and network topologies) that are consistent with a higher percentage of triplets (see the 
supplementary data section of |12j). If the user nevertheless requires the 5/12 lower bound 
to be mathematically guaranteed, then this can be ensured by limiting the number of 
blocks to at most 9. 

5. Constructing Simple Level-1 Networks 

In this section, we turn our attention to the second phase of LevIathan which is 
concerned with constructing a simple level-1 network from a triplet set T such that the 
number of triplets in T consistent with that network is maximized. Since this optimization 
problem is NP-hard |T5] in general, we propose to do this heuristically. (We remark that 



To avoid finishing with {L} we allow in the first, and only the first, iteration an operation to be applied 
as long as this does not decrease the score. After this operations are only permitted if they strictly increase 
the score. 
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for small instances LevIathan will also compare the heuristically computed simple level- 
1 network with an optimal tree computed using Wu's exponential-time algorithm [32] • If 
the tree is consistent with at least as many triplets as the simple level-1 network then 
LevIathan will return the tree, parsimony again being the motivation for this. We will 
not discuss this step further). Once again we emphasise that if LevIathan chooses for 
an Aho or JNS move when partitioning the leaf set of a triplet set, then the algorithms 
described in this section will not be used by LevIathan. 

To be able to describe our heuristic we need to generalize our notion of consistency which 
we will do next. Suppose that T is a triplet set and that £ is a partition of the leaf set 
L = L(T) of T. Suppose also that o,i,c6l are distinct elements in L and that Af is a 
network with leaf set £. Then we say that ab\c G T is consistent with Af if there exist 
distinct sets L a , L b , L c G L(Af) with a G L a , b G L b , c G L c such that L a L b \L c is consistent 
with Af. Furthermore, we say that ab\c is inconsistent with Af if ab\c is not consistent 
with Af but there do exist distinct sets L a , L b , L c G L(Af) with a G L a , b G L b , and c G L c . 
Note that, with this slightly generalized definition, it is possible that a triplet in t G T is 
neither consistent nor inconsistent with Af i.e. when t contains two or more leaves that lie 
in the same block of £. 

We are now in the position to briefly outline the second phase of our heuristic. Suppose 
T is a triplet set and £ is a partition of L(T). Then if the cardinality of £ is moderate 
(by default: at most 12) we compute an exact optimal solution in exponential time. This 
exact algorithm is described in Section 15.11 If the cardinality of £ is too big to compute an 
exact optimal solution, we use the greedy algorithm described in Section 15.21 Note that, 
although stated in terms of a triplet set T and a partition £ of L(T) (i.e. the partition 
chosen by the previous step of LevIathan), both algorithms can also be used for general 
leaf- and triplet sets: for an arbitrary set of triplets X" the partition of L{T') consisting of 
only singletons can be taken as £. 

5.1. Exact Algorithm. Van Iersel et al. [3 lj proposed an exponential-time exact algo- 
rithm to find a level-1 network consistent with a maximum number of triplets of a given 
set T in 0(m4 n ) time where m = \T\ and n = \L(T)\. This section describes how this 
running time can be improved to 0(nm2 n ) if an algorithm searches only for simple level-1 
networks. To describe such an algorithm we require two more concepts that concern phy- 
logenetic trees. A phylogenetic tree is said to be a caterpillar if the parents of the leaves 
form a directed path. We say that a caterpillar C ends in leaf r if the sibling of r is also a 
leaf of C (each caterpillar thus ends in exactly two leaves). For example the phylogenetic 
tree depicted in Figure [3Ja) is a caterpillar that ends in leaves x and y. 

Our algorithm, presented in the form of pseudo-code in Algorithm [21 consists of 2 steps 
and takes as input a triplet set T and a partition £ of the leaf set L = L(T) of T. It returns 
an optimal (in a well defined sense) simple level-1 network on £. It works by essentially 
trying each set L r G £ as the leaf below the reticulation of the simple level-1 network to 
be constructed. More precisely and with T and £ as above, the first of its 2 steps is as 
follows. For each set L r G £ we do the following. For each subset Z C £ with L r G Z, 
we first compute an optimal caterpillar Cz ending in L r in the sense that the number of 
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triplets in T consistent with the caterpillar Cz is maximal. To achieve this, note that the 
caterpillar Cz with \Z\ — 2 consists of a root and 2 distinct leaves each of which is the head 
of an arc starting at the root (lines 2-4). To find the other caterpillars, we loop through all 
subsets Z C £ with L r G Z from small to large, starting at the subsets of size three. For 
each such subset of C, we loop through all elements V G Z \ {L r } and create a candidate 
caterpillar by creating a new root and two arcs leaving the root: one to L' and one to the 
root of Cz\{v} (lines 5-9). Among all candidate caterpillars, we then select a caterpillar 
that is consistent with a maximum number of triplets in T. 

Once all the caterpillars have been constructed, an optimal simple level-1 network is 
found in the second step as follows. We loop through all bipartitions {Z, Y} of £ \ {L r }. 
First suppose that Z and Y are both nonempty. Then we combine the caterpillars Czu{L r } 
and Cyu{L r } m to a candidate simple level-1 network N Z; Y,r as follows. Let az = (vz,L r ) 
be the arc entering L r in Czu{L r }, let ay = (vy, L r ) be the arc entering L r in Cyu{L r } and 
let rz,ry denote the roots of Czu{L r } and CVu{L r } respectively. We add a new root r n and 
a new vertex v n and replace arcs az and ay by new arcs (r n , r^), (r n , ry), (vz, v n ), (vy, v n ) 
and (v n ,L r ) (lines 10-14). See Figure H^a) for an example of this construction. 



(a) (b) 




C Z u{L r } CVu{L r } Mzyr <~>zu{L,.} jsf ZYr 

Figure 4. (a) Construction of Af z ,Y,r from C ZU {L r } and Cyu{L r } if Z,Y ^ 0. 
(b) Construction of Nzy> r from Czu{L r } if Y = 0. 



Now suppose that Y = 0. Let again az = (vz,L r ) be the arc entering L r in Czu{L r } 
and let rz be the root of Czu{L r }- We add a new root r n and a new reticulation v n and 
replace arc az by arcs (r n ,r z ), (vz,v n ), (r n ,v n ) and (v n ,L r ). This leads to the candidate 
network Afz,y,r- See Figure H](b) for an example of the construction in this case. Finally, 
we select a network M consistent with a maximum number of input triplets, over all 
constructed candidates and return that network. 
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Algorithm 2 Exact Simple Level-1 Network Construction 
Input : Triplet set T and a partition £ of the leaf set L(T) of T. 

Output : Simple level-1 network M with leaf set £ consistent with a maximum number 
of triplets in T. 
1: for r = I . . . q := \£\ do 
2: for L x e£\ {L r } do 
3: Z <— {L r , L x } 

4: Cz is the caterpillar consisting of a root, the vertices L r and L x , and 2 arcs both 

starting at the root and one ending in L r and the other in L x . 
5: for c = 3, . . . , q do 

6: for each Z O £ with L r E Z and |Z| = c do 
7: for L' EZ\ {L r } do 

8: is the caterpillar consisting of the caterpillar CWfi/}> the leaf Z/, a new 

vertex (the root of C%), and two new arcs both starting at that vertex and 
one ending in V and the other in the root of Cz\{l'} 
9: Cz is the caterpillar C% that is consistent with a maximum number of input 

triplets over all V 
10: for all bipartitions {Z, Y} of £ \ L r do 
11: ifZ,F^0then 

12: combine caterpillars Czu{L r } and Cy\j{L r } into a candidate network Nz,y )T as in 

Figure H^a). 
13: if Y = then 

14: transform caterpillar Czu{L r \ into a candidate network Mz,Y,r as in Figure IU(b). 

15: return a network jV that is consistent with a maximum number of input triplets, over 
all constructed candidates Mzy,r 



A straightforward analysis of the above algorithm implies the following result. 

Theorem 1. Given a triplet set T with m = |T| and n = \L(T)\, a simple level-1 network 
consistent with a maximum number of triplets from T can be constructed in 0(nm2 n ) time. 

We now turn our attention to presenting our greedy algorithm which follows the same 
philosophy as the previous algorithm i.e. try each set in £ as the leaf below the reticulation 
of a simple level-1 network to be constructed. 

5.2. Greedy Algorithm. With T and £ as above, we next give the details of our greedy 
algorithm, which we present in the form of pseudo-code in Algorithm [31 In the context of 
this, it should be noted that a similar strategy was shown to perform particularly well in 
an algorithm by Snir and Rao for the construction of phylogenetic trees from triplets [2B] . 

For each set L r G £, we first create an initial network M with three vertices r, t and L r 
and three arcs (r,t), (r,t) and (t,L r ) (lines 2-4), where the arc (r, t) occurs twice. To this 
network we then add the other elements from £ in a greedy fashion and each time renaming 
the resulting network M . To decide which element Li G £\L{J\f) to insert first, and where 
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to insert it i.e. into which non-trivial arc a of Af, we use a score function s(Lj,a). To 
present this score function suppose that u and v are vertices of Af such that a = (u, v) is 
a non-trivial arc of Af and that Lj G C\ L(N). Let A/"(Lj, a) denote the network obtained 
from Af by removing arc a and adding two vertices Li and w and three arcs (u,w), (w,v) 
and (w, Li) to Af. Then the score s(Lj, a) is equal to the number of triplets t G T with 
L(t) H Lj 7^ that are consistent with Af(Li, a) minus the number of triplets t G T with 
L(t) fl Lj 7^ that are inconsistent with A/"(Lj, a). In other words, 

s(Lj, a) = |{i G T : t is consistent with A/"(Lj, a) and L(i) fl Lj 7^ 0}| 
- \{t G T : t is inconsistent with A/"(Lj, a) and L(i) n Lj 7^ 0}| 

Note that the definition of s(Li,a) does not consider triplets t for which L(t) contains at 
least one element in C that has not yet been added to the network. Also, the definition does 
not consider triplets t that are neither consistent nor inconsistent with Af(Li,a). (This is 
because the role of such triplets in the final network is entirely determined by the choice 
of C and choices made in later recursive steps). 

Suppose L* G C \ L(Af) and a* G A(Af) are such that s(Lj,a) is maximized. Then 
we construct a simple level-1 network Af(L*,a*) and insert the remaining leaves into this 
network by the same method (lines 5-8). Finally and by searching through all constructed 
simple level-1 networks Af r we return that the network Af that maximizes the number of 
triplets in T it is consistent with (lines 9-10). 

The algorithm can thus be summarised as follows. 

Algorithm 3 Greedy Simple Level-1 Network Construction 
Input : Triplet set T and a partition C of the leaf set L(T) of T. 

Output : Simple level-1 network Af on C (that heuristically optimises the number of 
triplets consistent with T). 
1: for r = 1 ... q do 

2: V r ^{r,t,L r } 

3: Ar <- {(r,t),(r,t),(t,L r )} 

4: Af r <- (V r ,A r ) 

5: while L(Af r ) ^ £ do 

6: compute s(Lj, a) for each Lj G C \ L(Af r ) and each nontrivial arc a of A/" r 

7: find L* , a* maximising s(Lj, a) 

8: jV r <— Af(L*, a*) 

9: let f{Af r ) be the number of triplets from T consistent with A/" r 

10: let Af maximise f(Af r ) over all r — 1 ... q. 

11: return A/" 



6. Experiments 

To better understand the behavior of LevIathan, we performed a biologically-motivated 
simulation study using triplet sets consistent with level-1 networks of different size and com- 
plexity and various amounts of missing data (experiment 1) and of noise (experiment 2). 
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To ensure not only variability of the input triplet sets for LevIathan but also consistency 
with a level-1 network, we used a novel algorithm for random level-1 network generation. 
After giving a general outline of our simulation study in the next section, we describe this 
algorithm in Section IST21 To model missing data and noise, rather then using the triplet set 
T(M) induced by such a network M. we used a triplet set T e (A4) as input for LevIathan 
where e is a parameter that governs the amount of missing data/noise in T(Ai). Details on 
the precise definition of T e (A4) will be given in the next section. Using a range of measures 
which we describe in Section 16.31 we present the outcomes of our study in Section 16.3.11 in 
case of missing data and in Section 16.3.21 in case of noise. 

6.1. General outline of our simulation study. Our simulation study consists of two 
experiments each of which is motivated by a situation one might encounter in a triplet 
based phylogenetic network analysis. The full results of (and inputs to) both experiments 
are available in the supplementary data section of [12]. The first experiment (Section 16.3. II 
- missing data) deals with the situation that only some of the triplets induced by some 
unknown level-1 network M. are given. This phenomena is modeled by setting T e (Ai) to 
be a randomly selected subset of T(A4). The second experiment (Section 16.3.21 - noise) 
addresses the situation that the confidence level one might have in the triplets generated 
in a phylogenetic analysis might vary. In our experiment this is modeled by adding noise 
to T(M.). Put differently, we essentially construct T t (Ai) by randomly selecting triplets 
from T(M) and for each such selected triplet t randomly "flipping" it to one of the two 
other possible triplet topologies on L(t). 

For both simulation experiments, the general outline is as follows. We first choose some 
level-1 network subNet as the "seed" for our random level-1 generator algorithm. For the 
generated level-1 network Ai, we then constructed the triplet set T e (Ai) from T(M) and 
use T e (Ai) as input for LevIathan. The level-1 network M generated by LevIathan we 
then compared against A4 with regards to topology and also the four measures described in 
Section I6T31 In each experiment we used a total of 110 randomly generated level-1 networks 
with leaf set size ranging between 22 and 115 and number of reticulations ranging between 
1 and 10. The running time of LevIathan on a standard desktop computer ranged from 
2-3 seconds for the simplest networks to 30 seconds for the most complex ones. 

6.2. Generating random level-1 networks. A survey of the literature suggested the 
NetGen [19] program to be the only available approach for systematically generating net- 
works. Whilst NetGen addresses the issues of size (i.e. number of vertices) and network 
complexity one would encounter when manually constructing networks, one of its main 
drawbacks is the lack of guarantee that the generated network is level-1 (although some 
happen to be level-1 networks). One way to overcome this problem is to hand pick suit- 
able networks from a (relatively small) subset of generated networks. Alternatively the list 
L>NetGen of networks generated by NetGen could be post-processed by manually removing a 
sufficient number of reticulations from each network in L^etGen to obtain a level-1 network. 
A further and maybe more important drawback of NetGen is that the structure of a gall in 
a generated network is rather simple in the sense that its size is 4. We therefore developed 
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our own algorithm for generating level-1 networks. This algorithm is implemented in Java 
and freely available for download from [T2]. We next describe this algorithm. 

Our algorithm for generating level-1 networks takes as input a level-1 network subNet 
on a fixed number m of leaves plus a positive integer n and outputs a level-1 network A4 
on a larger leaf set. A pseudo-code form of the algorithm is presented in Algorithm HI 



Algorithm 4 Level-1 network generator 

Input : A simple level-1 network subNet and a positive integer n. 
Output : Level-1 network A4 
1: E <- 

2: M. <— empty graph 

3: for i = 1 ... n do 

4: iVj <— generate instance of subNet 

5: E <- E U N { 

6: for all Ni G E do 

7: relabel the leaves in A, such that no two networks in E have the same leaf set, 
8: randomly choose some integer I in {0, 1, ... , |~|V(sMWVe£)|/4~|} 
9: randomly delete / vertices from Ni 

10: M <- Ai 

11: for z = 2 ... n do 

12: randomly choose a leaf of and replace it with the network iV, 
13: p <— number of leaves in KA 

14: randomly choose some integer I' in {0, 1, . . . , [p/2]} 

15: randomly delete V leaves and cherries from M. where, as usual, a cherry of a network 

Af' is a set of leaves a and b of M' such that the parent of a is also the parent of b. 
16: return M. 



We remark that the size of the gall C in subNet clearly influences the variability of 
networks generated by this algorithm. Also, we remark that for the purpose of the discussed 
simulation study subNet was the level-1 network depicted in Figure [5]^a). 
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Figure 5. The figure illustrates the generation of a level-1 network via our 
level-1 network generator algorithm (c. f. Algorithm HJ). The initial network 
subNet is depicted in (a). In the discussed example, m = 6 for the network 
pictured in (a). The networks presented in (b) and (c) are obtained by 
randomly removing vertices from two instances of the network in (a). The 
network depicted in (d) is the network Ai generated by replacing leaf a 5 in 
the network pictured in (b) with the network shown in (c). 

Algorithm H] starts by generating Ni . . . N n instances of subNet and storing them in set 
£ (lines 3-5). Next (lines 6-9), for each iV* G E we first randomly chose an integer I 
with < I < \\V {subN et)\ / A\ and then randomly delete I vertices v q , 1 < q < I from iVj 
making sure that no deleted vertex is the root or a leaf of Ni. If v q is a reticulation then we 
randomly choose one of the parents of v q , say p q , and add a new arc from p q to the unique 
child of v q . For any other deleted vertex v q we reconnect its unique parent with one of its 
children. The child to be reconnected is selected randomly, but a child that is not a leaf 
is always preferred over a child that is a leaf. We initialize our output level-1 network Ai 
with the network N\ (line 10) and then iterate over Ni, where i = 2 . . . n. At each iteration 
we randomly select a leaf from network Ai and replace it by Ni (line 12) yielding a new 
level-1 network Ai. Finally, we randomly remove a random number of leaves and cherries 
from Ai (lines 13-15) and then return the resulting network which we also call Ai. We 
conclude the description of Algorithm H] by making the trivial observation that the size of 
Ai depends on the number n of subNets. 

To illustrate the inner workings of the level-1 network generator algorithm suppose 
subNet is the level-1 network with leaf set {ai, . . . , a m } depicted in Figure 0(a) with m = 6 
and suppose that n = 2. Then we first generate 2 instances of that network. The deletion 
of vertices from each of that networks as described in line 9 of Algorithm H] results in 
the networks depicted in Figures [5(b) and[^c). Next, leaf 05 in the network depicted in 
Figure E(b) is replaced with the network pictured in Figure [3(c). The resulting network 
Ai is depicted in Figure [5(d). Note that due to the small number of leaves of subNet, the 
operation of removing random leaves and cherries from that network (lines 13-15) is not 
executed. 
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6.3. Measures. Reflecting the fact that our simulation study is aimed at assessing the 
reconstructive power of LevIathan by measuring the similarity between a level-1 net- 
work Mi and the level-1 network Mm,c generated by LevIathan when given T e (Ml), we 
considered 4 different measures. To be able to describe these measures, we need to fix 
some notation first. For the remainder of this section and unless stated otherwise, let Mi 
be a level-1 network and let Mm '■= Mm,£ denote the network that LevIathan generates 
when given T e (Ml) as input. 

The first measure is the network-network triplet consistency measure. For Mi and Mm we 
define the network- network triplet consistency measure C(Mi, T € (MI),Mm) as the quantity 

C { M,TAM),M„)^ (M)nT > (M)nT(MM 



\T{M)nT e (M)\ 

Loosely speaking, C(Ml, T e (Ml), Mm) is the proportion of "definitely correct" triplets (i.e. 
T(Mt) n T e (jH)) that are consistent with Mm- Thus C(Mi, T 6 (Mt), Mm) is always a real 
number between and 1. A variant of this, the triplet-network triplet consistency measure 
C(T,V), pays less heed to the origin or accuracy of the input triplets and is defined for an 
arbitrary triplet set T and a phylogenetic network V by putting 



C(T, V) 



\TnT{V)\ 

irj 



In other words, C(T,V) is the fraction of triplets in T that is consistent with the network 
V. Note that this measure is different from the triplets distance introduced in [6]. 

The third of our triplet based measures is inspired by the quartet distance for unrooted 
phylogenetic trees [5] and is called the network-network symmetric difference. For Mi and 
Mm the network- network symmetric difference S(A4,Mm) between Ml and Mm is defined 
as the quantity 

S(M,M M ) = \T(M)AT(M M )\- 

S(MI,Mm) is thus the number of triplets that appear in T(Mi) but not in T{Mm)i or vice- 
versa. Note that in this measure Mi and Mm are compared with regards to their induced 
triplet sets, while Mm was generated in response to the set T e {Mi) derived from T(Ml). 
As already noted above for the network-network triplet consistency measure, the network- 
network symmetric difference measure also suggests a natural variant of it, S(T,M), defined 
for an arbitrary triplet set T and a network M by putting S(T,M) = \TAT(M)\. 

Regarding the S(T,M) and C(T,M) measures it should be noted that the former might 
be more powerful than the latter. To see why consider again the example of the triplet 
set T(M) induced by the caterpillar M on X = {x,y, z,g, /} depicted in Figure E](a). As 
already pointed out earlier, the level-1 network M' on X depicted in Figure El^b) is also 
consistent with T(M) and no arc of M' can be deleted from M so that consistency with 
T(M) is preserved. If we take T(M) = T e (M), then (in the context of network-network 
triplet consistency) M and M' are equally good level-1 networks for representing T{M) 
since T(M) C T(M'). However, under the network-network symmetric difference measure 
M would be preferable over M', because of precisely that proper subset relationship. 
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The final measure we considered is the /^-distance which was originally introduced 
in [7]. To define this measure which was shown in [51 [7J to be a metric for a certain 
class of networks (i.e. tree-child networks), which includes the class of level- 1 networks 
as a subclass, we require some more notation. Suppose M is a phylogenetic network on 
some set X = {l,...,n}, n > 1, and v is a vertex of M. Then the vector fi_\f(v) = 
(mi(v), m 2 (f ), . . . ,m n (v)) can be associated to v where for all i e X the quantity rrii(v) 
represents the number of different paths in Af from v to leaf i. With denoting by fi(Af) the 
multiset of all vectors fi_\f(v), v a vertex of Af, and calling //(A/*) the [i-representation of Af, 
the //-distance d^Mi, Af-z) between any two phylogenetic networks Af\ and Mq, is defined as 

d fl (M 1 ,M 2 ) = \ t i(M 1 )A t i(M 2 )\ 

where the symmetric difference operator is defined here over multisets. 

Armed with these measures and our algorithm for randomly generating level- 1 networks, 
we are now ready to describe the results of our simulation study. (We note that, because 
the source networks used in the simulation had no vertices of outdegree 3 or higher, and for 
technical reasons, the optional post-processing phase used by Leviathan was switched 
off during these simulations). Assume then from now on that A4 is a level-1 network 
generated by our random level-1 network generator described in Algorithm H] and that the 
definition of the network Mm is as before. We start with presenting our results for the 
missing data experiment. 

6.3.1. Simulation results - missing data experiment. Central to this experiment is the pa- 
rameter e which indicates the probability that a triplet in T(A4) will be included in T e (A4). 
The values of e that we investigated were e = 1.0 (i.e. all triplets in T(A4)), 0.9, . . . ,0.4. 

Modulo a well-understood exception (described below) all networks A4 generated via 
Algorithm H] were recovered correctly by LevIathan when e = 1.0. This is a conse- 
quence of the fact that LevIathan prioritises JNS moves and that a level-1 network A4 
is completely defined by T(A4) up to, but not including, galls of size 4 [TO]. This is a 
drawback of any triplet based phylogenetic network approach since such approaches have 
to make a choice between the three galls on a set X ={a, b, c} that are all consistent with 
T = {ab\c, a\bc}. This inability to distinguish between the topologies of size-4 galls does 
not, of course, affect the triplet-based measures, and for e = 1.0 we had indeed in all cases 
that S(M,M M ) = and C (M , T € (M) , M M ) = 1- When LevIathan correctly guessed 
the topologies of all size-4 galls in a network A4 we additionally had d ll {AA.,AfM) = 0, but 
this value became non-zero when the guess was incorrect. 

For all other values of e and all networks A4, we frequently observed that - although 
similar when inspected visually - some of the galls from A4 were not reconstructed correctly 
in the generated level-1 network Mm, in the sense that the size of a gall in Mm was 
smaller than in the corresponding gall in A4 (e.g. a size 5 gall in A4 became size 4 gall 
in Mm, see Figure |6]). Although this observation is clearly dependent on the specific value 
used for e, it generally meant that one or more vertices had been pushed out of a gall 
in A4 to its sides, causing what we will call typical gall damage for A4. In turn this 
means that subnetworks hanging from galls are sometimes merged by LevIathan into 
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a single subnetwork. Even in the presence of typical gall damage, however, LevIathan 
sometimes (but not always) correctly inferred which leaf of Ai needed to be placed below 
the reticulation of the damaged gall. Interestingly, we also observed that typical gall 
damage was rare for galls G - even for low vales of e - if the subnetworks hanging from G 
contain many leaves. Expressed differently, the likelihood that a gall in Ai suffers typical 
gall damage for e < 1.0 is higher if the gall G is closer to the leaves. The reason for this 
might be that G is supported by far fewer triplets then a gall closer to the root. 

It should be noted that the exception to the former of the last two observations is the 
galled caterpillar network [U[T6] which can be thought of as a natural level-1 generalization 
of a caterpillar. Such networks Ai were correctly reconstructed by LevIathan for e > 0.8. 




c c 

(a) (b) 



Figure 6. The figure illustrates typical gall damage, (a) The level-1 net- 
work Ai on X = {a, . . . , d} with a gall G of size 5. (b) The level-1 network 
Mm on the same set with the erroneous reconstruction of G in terms of the 
gall G\ 

In addition, we observed for all networks Ai that when e dropped from 1.0 to 0.9, the 
network-network triplet consistency between Ai and Mm a l so drops slightly (often from 1 
to a value in the range 0.95 - 0.99) but immediately stabilises around that value, even for 
very low values of e. This phenomenon even occurs when there are very few or even no Aho 
or JNS moves occurring, suggesting that Heuristic moves are (in this context) good for sus- 
taining a very high value of triplet consistency. However, and for all values of e other than 
1.0 and all networks Ai, we also observed that even when C(Ai, T e (Ai), Mm) is ver Y close 
to 1, Ai is often not recovered correctly by LevIathan from T e (Ai). An example that 
illustrates this point is the triplet set T(M) of the level-1 network M = Mm depicted in 
Figure HD^b). This triplet set together with the triplet bc\a is the triplet set T(Ai) induced 
by the level-1 network Ai depicted in Figure E](a). The network generated by LevIathan 
from T t (Ai) = T(M) - {bc\a} is the level-1 network M and C(M,T e (M),M) = 1. Thus, 
rather than being a suitable tool for assessing LevIathan's reconstructive power, the 
C(Ai, T e (Ai), Mm) measure might be of limited use for capturing differences between net- 
works. 

For all values of e and averaged over all networks Ai for e, we observed an initial sharp 
rise for both the network-network symmetric difference and the /i-distance when e drops 
from 1.0 to 0.9. Again in both cases this initial rise is followed by a a much slower growth 
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rate although this rate seems to be higher for the /z-distance. Encouragingly, we found 
instances of networks M. where for e > 0.8 LevIathan correctly inferred the missing 
triplet information, that is, correctly reconstructed Ml from T £ (Ai). 



6.3.2. Simulation results - noise experiment. Central to this experiment is again the pa- 
rameter e which in this case is an error probability and which we state in terms of values 
between and 1. Its purpose is to introduce noise into T(Mt) and the range we considered 
was 0.00, 0.01, 0.02 ... 0.10 and 0.10, 0.15, . . . , 0.50. More precisely, we generated T t (M) 
from T(Mt) by taking each triplet t G T(M) and with probability e decide to corrupt 
it, that is, replace t in T(M) with one of the 2 other phylogenetic trees on L(t), chosen 
uniformly at random. 

We observed that even for very low error probability values, the triplet set T e (M) is 
unlikely to be consistent with a level- 1 network. It is thus not surprising that we often 
observed additional (erroneous) reticulations in networks reconstructed by LevIathan 
from T e (Mi). Nevertheless and based on visual inspection, our experiment seems to indicate 
that even for slightly higher values of e, i. e e = 0.05, 0.06, . . . , 0.15, and all networks M., 
much of the structure of Ai is recovered correctly by LevIathan from T t (M). Having 
said that, and in addition to the above mentioned erroneous reticulations, typical gall 
damage is common in the generated networks. 

For all e other than 0.00 and averaged over all networks M. for e, we observed that, as 
expected, C (T e (J\4) , Mm) decreased linearly with increasing error probability in the sense 
that C(T e (Ml), Mm) ~ 1 — e holds. Reassuringly (and by no means obviously) this al- 
most linear relationship is not obeyed by the network-network triplet consistency measure 
C (Mt , T £ (M) , Mm) • By averaging for each error probability value over all networks Ml, we 
summarize our findings for that measure in Figure [7] in terms of plotting the error prob- 
ability values for e against the corresponding network-network triplet consistency values. 
Intriguingly, there is an initial 2% drop after which network-network triplet consistency 
remains high until error rates reach values of e > 0.30. It should be noted that this in fact 
corresponds to a high level of noise in T(Ml) given that if Ml is a phylogenetic tree and 
e = 0.66 then T e (Mt) is a completely randomized triplet set and all structural information 
contained in T e (Ml) concerning Mi has been lost. The most likely explanation for the initial 
2% drop is probably (again) typical gall damage since, as alluded to above, a given gall 
represents more triplets then its a gall-damaged counterpart. The fact that after the initial 
drop network-network triplet consistency remains high, is encouraging, because it shows 
that LevIathan holds promise for reconstructing triplets that have not been corrupted 
by noise, up to quite a high level of noise. 
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Figure 7. In the noise experiment the C(M.,T e (M),Afj^) measure expe- 
riences an initial sharp drop of about two percent, after which it remains 
relatively stable until the error probability becomes large (bigger than 0.3). 

For all values of e and averaged over all networks Ml for e, we also observed an initial 
sharp increase when e increases form to 0.01 for both the network- network symmetric 
difference and the /i-difference. After this, the values for both measure grow very slowly 
with the growth rate for the //-distance being higher until e reaches 0.3 when the growth 
rate of the network-network symmetric difference takes over. Intriguingly the network- 
network symmetric difference seems to reach a peak at e = 0.4 and then drops of again for 
e = 0.45. However and in the light of the fact that, as pointed out above, 0.3 is already a 
high error probability the later 2 observations should be treated with caution. A possible 
reason for the above mentioned initial sharp rise might be that even one corrupted triplet 
can potentially introduce erroneous (additional) galls in Mm- Combined with the problem 
of typical gall damage, this can greatly affect the triplet sets induced by Ml and Mm as 
well as their //-representations and thus the network-network symmetric difference and the 
/i-distance which rely on these concepts respectively. 

Interestingly we identified some networks M. such that when e = 0.01 (and addition- 
ally T t (M) ^ T(M)) we nevertheless had that (I^MMm) = = S(M,M M )- Visual 
inspection revealed that in such cases Ml was equal to Mm- This suggests that if the noise 
level in an input triplet set is small enough LevIathan might still be able to correctly 
reconstruct the level-1 network that induced that triplet set. 

6.4. A HIV-1 virus data set. To illustrate the applicability of our approach, we re- 
analyzed a HIV-1 virus data set that originally appeared in [211 Chapter 14]. All but one 
of the sequences (KAL153) making up that data set are 50 percent consensus sequences 
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Figure 8. The phylogenetic trees for the three sub-alignments with the out- 
group C omitted in each case, (a) sub-alignment 1-2699, (b) sub-alignment 
2700- 8925, and (c) sub-alignment 8926-9953. 

for the HIV-1 M-group subtypes A, . . . ,D, F,G, H, and J with the KAL153 strain being 
thought to be a recombinant of subtypes A and B. Recombinants such as KAL153 can 
essentially be thought of as having arisen via the transfer and integration of genetic material 
from one evolutionary lineage into another. The positions in an existing sequence where 
the foreign genetic material was integrated are generally called breakpoints and many 
approaches for detecting recombination aim to identify these breakpoints. 

For the above data set two breakpoints were identified (positions 2700 and 8926) in [211 
Chapter 14]. Furthermore, for the three induced sub-alignments 1-2699, 2700- 8925, and 
8926-9953 the Neighbor Joining method [20] (with subtype C as outgroup and the K2P 
model [18] ) was used to represent the data in terms of arc- weighted phylogenetic trees (see 
[21j page 159] for a depiction of those trees). Since the resolution patterns for J and G 
in the first tree, H and C in the second, and G and J in the third tree was not clear, we 
recomputed those trees using the above settings. Reassuringly and when arc weights were 
ignored, this resulted in the same phylogenetic trees for the first and third sub-alignment as 
in the previous analysis except that the unresolved vertex in each tree was now resolved. 
For the second sub- alignment, the same tree was obtained. For the convenience of the 
reader, we depict these phylogenetic trees in Figure El 

As expected, the position of the KALI 53 strain is the same in the first and third topology 
but different in the second topology. However and somewhat surprisingly the resolution 
order of subtypes G and J is in the first and third tree is different as well as the place- 
ment of subtype F in the tree. Having said this, the branch weights that support these 
different resolution orders are very small. To therefore not allow this to overly influence 
our analysis (after all our method as well as the other two methods that we tried out are 
using phylogenetic trees rather than arc-weighted phylogenetic trees as input and therefore 
these different levels of support are not taken into account), we only used the first and the 
second and the third and second tree as respective inputs for our analysis. 

Interestingly the second phylogenetic tree postulates the triplet FB\A on subtypes A, F 
and B whereas the first tree hypotheses AF\B. Since this conflict also interferes with the 
conflicting information for subtypes A, B and KAL153, attempting to combine the triplet 
set generated from both phylogenetic trees into a general level-1 network is problematic 
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Figure 9. The optimal simple level-1 network constructed by LevIathan 
for the first and second tree in Figure [HJ For each taxa x the percentage 
value denotes the percentage of input triplets containing x that the network 
is consistent with. This can be interpreted as how "satisfied" the taxa in 
question is with its location in the network. The value on each arc is the 
deletion support value for that arc i.e. the number of input triplets that 
would cease to be consistent with the generated network if that arc was 
deleted. 



as such a network would postulate 2 recombination events for this data set. To avoid this 
and at the same time identify the stronger of the two conflicting signals it is therefore 
more useful to construct (using Algorithm [2]) an optimal simple level-1 network, which by 
definition has only one reticulation vertex. This type of network is depicted for the first 
and second tree from Figure [H] in Figure As expected, the network correctly identifies 
the KALlb?) strain as a recombinant of subtypes A and B. 

It should be noted that repeating this analysis for the second and third tree in Figure [HJ 
resulted in the same simple level-1 network as the one depicted in Figure [9] except that the 
order of G and J was reversed. The same holds when the optimal simple level-1 network 
is computed for the respective original phylogenetic trees from [2TI Chapter 14] when 
ignoring arc-weights. Interestingly and in case of the second and the third tree, exclusion 
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of subgroup F also resulted in a simple level-1 network that correctly identified the KAL153 
as recombinant. However this was not the case when this analysis was repeated for the 
first and second tree. 

We conclude this section with remarking that although using different philosophies, the 
other two known approaches i.e. Cluster networks and Galled networks (both of which 
are implemented in Dendroscope [TJ]) also had problems with this data set, postulating 
between 2 and 4 recombination events (data not shown). 



7. Conclusions 

In this paper, we have presented a heuristic for constructing level-1 networks from triplet 
sets which we have also implemented in Java as the freely available program LevIathan 
[12] . Guided by the principle of triplet consistency, our heuristic aims to optimize a well- 
defined objective function on triplet sets without generating pessimistically complex net- 
works for them. By running in polynomial time and always returning a level-1 network, it 
addresses several of the problems that frequently occur with existing network algorithms 
from such input sets. Using both a biologically motivated simulation study and a biological 
data set, we have also explored LevIathan's applicability to real biological data. 

Based on the outcomes of our simulation study, it appears that our heuristic is able 
to recover, in terms of the triplet set induced by the generated level-1 network, a high 
percentage of the input triplets that were also present in the original network (as opposed 
to triplets that were missing or that had been corrupted). This is nota bene also true 
when the input triplet set is not dense, which (in light of the NP-hardness of the non-dense 
case) is an encouraging observation. On the other hand (and probably not surprisingly 
as our heuristic tries to reconstruct a global structure from local information) it seems 
that it is more vulnerable to noise in an input triplet set than missing data. The most 
probable reason for this is that the former type of problem can sometimes be rectified via 
implicitly inferred triplet information, whereas the latter type of problem has the capacity 
to actively mislead. Having said that, LevIathan shows encouraging potential if the 
amount of missing data is low or there is only very little noise in the data. 

In general using more of the triplets induced by a network as input for LevIathan allows 
larger regions of that network to be recovered by it. Having said that, when confronted with 
missing data or noise, even extremely high values of network-network triplet consistency 
(e.g. above 0.95) do not preclude non-trivial differences between the original network and 
the network generated by LevIathan. Additionally, the noisier a input triplet set for 
our heuristic is, the greater the chance that the network found by it is distorted (e.g. 
through the emergence of surplus galls in the generated network). To tackle this problem 
LevIathan has the option to label each arc of the generated network with its deletion 
support value, see Figure [9j This allows the user to explore which reticulations are weakly 
supported, and thus might be superfluous or even artefacts of our heuristic. 

Although the four triplet-based measures used in this article appear to be very natural 
they only seem to be of limited use for capturing network differences in general. However 
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some of them helped to identify cases where a network generated by LevIathan from 
T e (A4) coincided with the original network M.. 

Our re-analysis of a biological data set from [21] using LevIathan indicated that this 
data set is more complex than it appears at first sight. The resulting conflicting triplet 
infomation misled LevIathan (and also the other two network construction approaches 
that we tried) to postulate a too complex evolutionary scenario when using its default 
option of generating a level-1 network. However LEVlATHAN's simple level-1 network 
option had no problem with this data set and was able to correctly reconstruct the data 
set's widely accepted evolutionary scenario. 

To understand better how well LevIathan performs, it will be necessary to compare it 
directly with an alternative method for network construction that uses similar input and 
also produces a level-1 network. Such methods are lacking at the moment. Similarly, it is 
at the moment difficult to draw conclusions regarding the biological meaning of measures 
such as the /z-distance, which we used in our simulation study. Comparison should also 
be made between LevIathan and other programs when the input is strictly simpler, or 
strictly more complex, than level-1 networks. Figures [TOl and [Til already provide interesting 
insights. Figure [10] shows a level-2 network created by the Level2 algorithm of [28] which 
is consistent with all 1330 triplets in the yeast dataset discussed in that same article. For 
the same dataset LevIathan constructs the level-1 network in Figure [Til this is consistent 
with 94.28% of the 1330 triplets. Both networks cluster the taxa together similarly; the 
main difference is in the LevIathan network taxa 12 and 5 have been pushed further 
away from the root, whilst taxon 8 has risen closer to the root. However which one of 
these two networks is biologically more relevant is not immediately clear. In any case it 
should once again be noted that LevIathan is in many regards much more flexible than 
the Level2 algorithm (and related algorithms such as [26], [29], [30]) because LevIathan 
always quickly returns a network and it does not require the input triplets to be dense or 
fully consistent with the output network. (In particular, the authors of [28] had to use 
time-consuming brute-force techniques to first find a subset of the taxa that induced a 
triplet set fully consistent with a level-2 network). 

It is also necessary to look more deeply at the underlying mathematics of LevIathan. 
For example, the partitioning strategy at its heart (i.e. Phase 1) is a modification of a 
strategy that was originally optimized for worst-case performance, not average-case per- 
formance. Yet this strategy also seems to perform surprisingly well in the average case. 
Understanding why this is, and developing a better appreciation for the theoretical limits 
and strengths of triplet methods as a mechanism for reconstructing network topologies, 
remains a fascinating area of research. 
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Figure 10. A level-2 network on {1, ... , 21} constructed by the LEVEL2 
algorithm [28] for a blinded yeast dataset also described in that paper. This 
network is consistent with all 1330 triplets in the dataset. 
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Figure 11. The level-1 network constructed by LevIathan when given 
the same dataset as discussed in Figure [TU1 This network is consistent with 
94.28% of the 1330 input triplets. 
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